ODESIA Leaderboard
Evaluación de modelos de lenguaje en inglés y español
Objetivos: establecer una comparación directa entre el rendimiento de modelos en inglés y español para medir la brecha de efectividad.
Método: evaluación sobre el Benchmark ODESIA, una colección de tareas de Procesamiento del Lenguaje Natural con conjuntos de datos comparables en inglés y español.
Objetivos
El Leaderboard ODESIA permite (I) medir la brecha de efectividad de los modelos de lenguaje en español respecto al inglés; (II) evaluar de forma comparada modelos de lenguaje en español. Si has desarrollado un modelo de lenguaje en español, ¡envía tus resultados!
Resultados
La brecha de efectividad promedio entre Español e Inglés es del 20%, , con un error estándar de +-6%. Hay que destacar que la brecha es más acusada en las tareas más difíciles (hasta superar el 200% en la tarea con mayor dificultad intrínseca), y por tanto el valor promedio tiene una representatividad relativa.
Tareas
Se utilizan dos conjuntos de tareas: (I) ODESIA CORE, , bilingual tasks with private test data (this avoids contamination, that the models have seen the evaluation keys in the pre-training phase); and (II) ODESIA EXTENDED,que añade un conjunto de cinco tareas bilingües estándar y disponibles de forma pública.
Metodología
ODESIA Leaderboard utiliza un conjunto de 14 tareas bilingües para comparar el estado del arte en inglés y español. Sobre cada tarea (I) se estima la dificultad intrínseca aplicando varios algoritmos no lingüísticos y (II) se calibran los mejores resultados en cada idioma usando esa dificultad intrínseca.
Leaderboard
Odesia Core Tasks
# | Sistema | Media aritmética | EXIST 2022: Sexism detection (ES) | EXIST 2022: Sexism categorisation (ES) | DIPROMATS 2023: Propaganda identification (ES) | DIPROMATS 2023: Coarse propaganda characterization (ES) | DIPROMATS 2023: Fine-grained propaganda characterization (ES) | DIANN 2023: Disability detection (ES) | EXIST-2023: Sexism identification (ES) | EXIST-2023: Source Intention (ES) | EXIST-2023: Sexism categorization (ES) | SQAC-SQUAD 2024: Question answering (ES) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | distilbert-base-multilingual-cased | 0.459 | 0.72 | 0.47 | 0.75 | 0.34 | 0.09 | 0.78 | 0.57 | 0.36 | 0.29 | 0.22 |
2 | distillbert-base-spanish-uncased | 0.473 | 0.72 | 0.51 | 0.77 | 0.34 | 0.07 | 0.75 | 0.60 | 0.39 | 0.33 | 0.25 |
3 | xlm-roberta-base | 0.515 | 0.74 | 0.50 | 0.79 | 0.47 | 0.10 | 0.84 | 0.62 | 0.40 | 0.32 | 0.37 |
4 | ixambert-base-cased | 0.485 | 0.71 | 0.49 | 0.77 | 0.32 | 0.06 | 0.83 | 0.60 | 0.37 | 0.34 | 0.36 |
5 | bert-base-multilingual-cased | 0.488 | 0.72 | 0.47 | 0.78 | 0.35 | 0.10 | 0.84 | 0.60 | 0.37 | 0.33 | 0.32 |
6 | bert-base-spanish-wwm-cased | 0.524 | 0.72 | 0.54 | 0.79 | 0.44 | 0.14 | 0.81 | 0.63 | 0.39 | 0.37 | 0.41 |
7 | PlanTL-GOB-ES-roberta-base-bne | 0.521 | 0.74 | 0.56 | 0.81 | 0.42 | 0.12 | 0.75 | 0.63 | 0.40 | 0.37 | 0.41 |
8 | bertin-roberta-base-spanish | 0.493 | 0.73 | 0.49 | 0.76 | 0.36 | 0.08 | 0.75 | 0.62 | 0.39 | 0.33 | 0.42 |
9 | PlanTL-GOB-ES-roberta-large-bne | 0.552 | 0.75 | 0.57 | 0.82 | 0.44 | 0.24 | 0.82 | 0.64 | 0.40 | 0.38 | 0.46 |
10 | xlm-roberta-large | 0.564 | 0.77 | 0.56 | 0.82 | 0.47 | 0.26 | 0.84 | 0.64 | 0.42 | 0.40 | 0.46 |
# | Sistema | Media aritmética | EXIST 2022: Sexism detection (EN) | EXIST 2022: Sexism categorisation (EN) | DIANN 2023: Disability detection (EN) | DIPROMATS 2023: Propaganda identification (EN) | DIPROMATS 2023: Coarse propaganda characterization (EN) | DIPROMATS 2023: Fine-grained propaganda characterization (EN) | EXIST-2023: Sexism categorization (EN) | EXIST-2023: Sexism identification (EN) | EXIST-2023: Source intention (EN) | SQAC-SQUAD 2024: Question answering (EN) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | bert-base-multilingual-cased | 0.501 | 0.76 | 0.50 | 0.73 | 0.80 | 0.48 | 0.18 | 0.34 | 0.60 | 0.32 | 0.30 |
2 | distilbert-base-multilingual-cased | 0.472 | 0.74 | 0.53 | 0.68 | 0.77 | 0.45 | 0.16 | 0.30 | 0.58 | 0.31 | 0.20 |
3 | distilbert-base-uncased | 0.497 | 0.77 | 0.55 | 0.66 | 0.78 | 0.47 | 0.14 | 0.37 | 0.62 | 0.34 | 0.27 |
4 | bert-base-cased | 0.513 | 0.76 | 0.53 | 0.72 | 0.81 | 0.50 | 0.21 | 0.37 | 0.61 | 0.32 | 0.30 |
5 | ixambert-base-cased | 0.503 | 0.75 | 0.53 | 0.73 | 0.78 | 0.49 | 0.14 | 0.36 | 0.61 | 0.32 | 0.32 |
6 | xlm-roberta-base | 0.517 | 0.76 | 0.53 | 0.76 | 0.80 | 0.54 | 0.16 | 0.35 | 0.62 | 0.32 | 0.33 |
7 | roberta-base | 0.530 | 0.78 | 0.53 | 0.75 | 0.81 | 0.52 | 0.19 | 0.38 | 0.63 | 0.33 | 0.38 |
8 | xlm-roberta-large | 0.565 | 0.79 | 0.56 | 0.78 | 0.81 | 0.52 | 0.39 | 0.39 | 0.63 | 0.36 | 0.42 |
9 | roberta-large | 0.587 | 0.81 | 0.58 | 0.79 | 0.82 | 0.55 | 0.47 | 0.40 | 0.64 | 0.35 | 0.46 |
Odesia Extended Tasks
# | Sistema | Media aritmética | MLDOC 2018: Document classification (ES) | Multilingual Complex Named Entity Recognition 2022 (ES) | SQAC-SQUAD 2016: Question answering (ES) | Semantic Textual Similarity 2017 (ES) |
---|---|---|---|---|---|---|
1 | ixambert-base-cased | 0.778 | 0.96 | 0.63 | 0.71 | 0.81 |
2 | bertin-roberta-base-spanish | 0.745 | 0.96 | 0.62 | 0.73 | 0.67 |
3 | distilbert-base-multilingual-cased | 0.698 | 0.94 | 0.61 | 0.55 | 0.69 |
4 | bert-base-multilingual-cased | 0.753 | 0.96 | 0.64 | 0.71 | 0.70 |
5 | xlm-roberta-base | 0.753 | 0.95 | 0.66 | 0.67 | 0.73 |
6 | distillbert-base-spanish-uncased | 0.710 | 0.96 | 0.61 | 0.53 | 0.74 |
7 | PlanTL-GOB-ES-roberta-base-bne | 0.773 | 0.96 | 0.64 | 0.74 | 0.75 |
8 | PlanTL-GOB-ES-roberta-large-bne | 0.780 | 0.96 | 0.63 | 0.77 | 0.76 |
9 | bert-base-spanish-wwm-cased | 0.773 | 0.96 | 0.63 | 0.71 | 0.79 |
10 | xlm-roberta-large | 0.810 | 0.96 | 0.71 | 0.77 | 0.80 |
# | Sistema | Media aritmética | MLDOC 2018: Document classification (EN) | Multilingual Complex Named Entity Recognition 2022 (EN) | SQAC-SQUAD 2016: Question answering (EN) | Semantic Textual Similarity 2017 (EN) |
---|---|---|---|---|---|---|
1 | bert-base-multilingual-cased | 0.813 | 0.97 | 0.67 | 0.81 | 0.80 |
2 | ixambert-base-cased | 0.813 | 0.98 | 0.65 | 0.80 | 0.82 |
3 | distilbert-base-multilingual-cased | 0.778 | 0.97 | 0.63 | 0.75 | 0.76 |
4 | xlm-roberta-base | 0.818 | 0.98 | 0.69 | 0.80 | 0.80 |
5 | distilbert-base-uncased | 0.805 | 0.97 | 0.67 | 0.77 | 0.81 |
6 | bert-base-cased | 0.813 | 0.97 | 0.68 | 0.78 | 0.82 |
7 | roberta-base | 0.845 | 0.98 | 0.70 | 0.85 | 0.85 |
8 | roberta-large | 0.868 | 0.98 | 0.75 | 0.88 | 0.86 |
9 | xlm-roberta-large | 0.855 | 0.98 | 0.74 | 0.86 | 0.84 |
Compruebe todos los resultados en el Leaderboard
Gap Español-Inglés
La brecha total entre el español y el inglés es del 16%
Odesia Core Tasks
Tareas | Mejor resultado en Español | Mejor resultado en Inglés | |
---|---|---|---|
Media total | 0.60 | 0.60 | 14% |
EXIST 2022: Sexism detection (ES) | 0.77 | 0.81 | 17% |
EXIST 2022: Sexism categorisation (ES) | 0.57 | 0.58 | 10% |
DIPROMATS 2023: Propaganda identification (ES) | 0.82 | 0.82 | 11% |
DIPROMATS 2023: Coarse propaganda characterization (ES) | 0.47 | 0.55 | 48% |
DIPROMATS 2023: Fine-grained propaganda characterization (ES) | 0.26 | 0.47 | 299% |
DIANN 2023: Disability detection (ES) | 0.84 | 0.79 | 1% |
EXIST-2023: Sexism identification (ES) | 0.64 | 0.64 | 10% |
EXIST-2023: Source Intention (ES) | 0.42 | 0.36 | -4% |
EXIST-2023: Sexism categorization (ES) | 0.40 | 0.40 | 12% |
SQAC-SQUAD 2024: Question answering (ES) | 0.46 | 0.46 | 19% |
Odesia Extended Tasks
Tareas | Mejor resultado en Español | Mejor resultado en Inglés | |
---|---|---|---|
Total mean | 0.81 | 0.87 | 20.75% |
MLDOC 2018: Document classification (ES) | 0.96 | 0.98 | 40% |
Multilingual Complex Named Entity Recognition 2022 (ES) | 0.71 | 0.75 | 5% |
SQAC-SQUAD 2016: Question answering (ES) | 0.77 | 0.88 | 25% |
Semantic Textual Similarity 2017 (ES) | 0.81 | 0.86 | 13% |
Compruebe todos los resultados en el Leaderboard